Ray tracing has several potential advantages over current raster-based techniques. Ray tracing's complexity grows logarithmically with model size, while raster's complexity grows linearly. Ray tracing also naturally implements shadows, translucency, and reflectivity without the hacks used in raster techniques today. Geometric primitives such as spheres can also be modeled directly without being broken into triangle meshes.
The 8 SPE (Synergistic Processing Elements) are optimized for SIMD (single instruction multiple data) processing. They each have 128 128-bit registers, which can be used to store vector datatypes (such as a vector of 4 floats for {x,y,z,w}) and perform vector operations (such as vector multiply). Like the PowerPC processor they lack hardware branch prediction and out of order execution. Unlike the PowerPC processor they do not have caches or direct access to system memory. Instead they each have 256K local store used for both program and data. System memory is accessed by explicit asynchronous DMA requests, and each SPE may have multiple requests outstanding. With a high-speed bus between processors internal to the chip this system is optimized for stream-style computing.
On the Playstation 3 seven of eight SPEs are active -- the last failed verification (if it passed it went into an IBM blade system instead :-). One of these is running a dedicated Hypervisor, so 6 SPEs are available to a Linux application. The program main is started on the PowerPC, and it can make normal Linux system calls. The PowerPC program can start threads on the SPEs, and communicate with them through hardware mailbox message passing, hardware signals, or shared memory (accessed via DMA on the SPE side).
My program was developed in C using GCC directly on the PS3. I did not try the IBM XL C/C++ compiler, as it requires cross-compilation on a x86 Fedora 5 Linux machine (which I don't have). Developing directly on the PS3 was ok; it supports X-Windows Gnome and runs at a tolerable resolution (720p) with PS3 component cables to the monitor. I attached a USB mouse and keyboard. Wired ethernet works, but the PS3 Wi-Fi is not supported under Linux. Gnome is slow with only 256M useable by Linux and no hardware graphics acceleration, and the full install of Fedora 5 does occupy 95% of the 10G Linux partition on the hard drive. Deleting OpenOffice, etc. would free up a lot of space...
Versions of GCC are used to produce executable code for both the PowerPC and SPEs. The C language has been extended to support the vector types and operators, and additional libraries are provided for basic vector math and other operations.
The work has been apportioned between the PowerPC and SPEs. The PowerPC initializes SDL (Simple DirectMedia Layer), which provides a simple framebuffer API both on top of X-Windows and windowless Linux. It starts a thread on each SPE, and sends the current X coordinate of the red sphere (controlled by keyboard input) in a mailbox message to each SPE. It then (busy) waits for mailbox replies from each SPE. When they have all replied it copies the pixelColor array (filled by the SPEs) to the SDL frame buffer, and flips buffers. It checks for keyboard input, and sends new mailbox messages to each SPE.
The SPEs are initialized with an offset equal to their id (0-5). They render a screen line (1024 pixels) at a time to their internal memory, then DMA transfer it to the appropriate location in the pixelColor array shared with the PowerPC. They have two line-sized buffers so they can start rendering the next line while the previous one is queued for DMA transfer. Each SPE starts its next line 6 later than the previous one so complex parts of the scene will not overload a single SPE. I.e. SPE 0 does lines 0, 6, 12; SPE 1 does lines 1, 7, 13, etc. When all its lines in the screen have completed DMA the SPE sends a mailbox message to the PowerPC, and awaits a reply with the new red sphere X coordinate. The white spheres and light source are animated locally, since all SPEs are globally synchronized per frame by the PowerPC.
At thread start all objects in the scene are initialized in the memory of each SPE, so no DMA is needed to load objects during processing. With simple spheres up to 2048 objects can fit in memory (1000 objects ran at 0.34 frames per second). Other Cell ray tracers (discussed below) have implemented software caches (plus software hyperthreading) to manage dynamically loading objects and spatial index nodes. This is a limitation compared to conventional CPUs, which automate this process with their L1 and L2 caches and transparent access to system memory.
My program is implemented as both a generic version which compiles on any Linux system, as well as a Cell-specific version. The first two bars compare the generic versions on a 2.2 Ghz Athlon and the 3.2 Ghz PowerPC in the Cell. Here we see the Cell PowerPC is much less powerful than the Athlon. Note the Athlon is actually dual-core (and the PowerPC actually has two hardware threads), but the generic code doesn't utilize multiple threads. The generic code utilizes the software vector library from the Graphics Gems texts. The Cell-specific version uses the vector datatypes and operators available in the SPE hardware, and vector libraries provided by IBM. The switch from Graphics Gems software vectors to SPE hardware vectors provided a 62% speedup. Without this speedup the single SPE would be slower than the PowerPC alone; the SPE is designed to be less efficient than the PowerPC on non-vector operations. Note the Athlon has similar SSE3 hardware vector operations available; these are not utilized in the generic code. Other Cell ray tracers (discussed below) have achieved significant pipeline speedups by converting their vector operations from AOS (Array of Structures) to SOA (Structure of Arrays) form and batching multiple rays together; this transformation is complicated since branches (where some rays intersect an object while others don't) are involved.
Note both the graph above and the one below indicate we are CPU bound on the SPEs.
My code is available here.
IBM has changed the way SPU threads are initialized and controlled; they now expose a pthreads interface. My source code now includes both old (LIBSPE 1 / SDK 2.0) and new (LIBSPE 2 / SDK 2.1) versions. No performance difference was seen between LIBSPE 1 and 2.